Chapter 1 - Introduction to Open Data Science

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

date()
## [1] "Mon Dec  4 19:57:23 2023"

[1] “Thu Nov 2 13:02:54 2023”

There seems to be many new things to learn, especially with GitHub with me. Learning to work with GitHub, as mentioned, I am expecting to improve during the course, as well as statistics on large data sets with R.

The Exercise1.Rmd-file had nice summary on handling the data in R in chapters 1 & 2. Some of them I had gotten more familiar with, but there where a lot of things to learn, especially chapter 2 had nice suggestions/tricks in the examples how to handle data frames.

Using RStudio seems still quite fine, loading files to GitHub seems still a bit unclear and I hope I will get more comfortable with it as the course goes on. Exercise set 1. was quite extensive, most was familiar/known, but it’s not bad to refresh your memory.

GitHub repository: https://github.com/venkatharina/IODS-project GitHub diary: https://venkatharina.github.io/IODS-project/


Chapter 2 - Regression and model validation

Week 2: Regression and model validation

date() 
## [1] "Mon Dec  4 19:57:23 2023"

2) Analysis exercises

newfile <- read.table(url("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt"), sep = ",", header = TRUE)
dim(newfile) #166 rows and 7 columns
## [1] 166   7
head(newfile)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
colnames(newfile) 
## [1] "gender"   "age"      "attitude" "deep"     "stra"     "surf"     "points"
#column names:"gender","age","attitude","deep","stra","surf","points"

###
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

summary(newfile)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00
table(newfile$gender)
## 
##   F   M 
## 110  56
ggpairs(newfile, lower = list(combo = wrap("facethist", bins = 20)))

#as individuals mean values are age:25.5, attitude:3.143, deep:3.68, stra:3.12, surf:2.78, points:22.7
#there are 110 females, and 56 males
#there is significant correlation between surf-attitude*, points-attitude***, surf-deep*** and surf-stra*
#visually, attitude, stra, surf seem normally distributed

#choosing three variables: attitude, surf, stra
model = lm(points ~ attitude+surf+stra, data=newfile)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + surf + stra, data = newfile)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
par(mfrow = c(2,2))
#attitude with p-value 1.93e-08 *** seem highly significant in the model
#where as surf and stra are not significant to explain points
#model explains ~20% of variance of points (multiple r-squared)

#if we remove variables surf and stra we get model2
model2 = lm(points ~ attitude, data=newfile)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = newfile)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
#attitude explains significantly points but only 19% of variance (multiple r-squared)

#plotting models
par(mfrow = c(2,2))
plot(model)

par(mfrow = c(2,2))
plot(model2)

#generally all the plots look good (model linear)
#top left: residuals vs fitted plot; close to zero (independent from each other), fitted values are forecasts of  observed values, residuals what are left out of forecast
#to right: QQ-plot; residuals compared to normal distribution seem very linear (on the line)
#low left: scale-location plot; absolute residuals values with fitted values, similar magnitudes of residuals across fitted values
#low right: residuals vs leverage plot; how far independent variables are from observations; no points with high residue and high leverage

Chapter 3 - Logistic regression

Week 3: Logistic regression

date() 
## [1] "Mon Dec  4 19:57:26 2023"

2) Analysis exercises

#reading table
a <- read.table(url("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv"), sep = ",", header = TRUE)
#looks same as I got from Data wrangling exercises
dim(a) #dimension of table: 370 observations and 35 variables
## [1] 370  35
str(a) #structure of table
## 'data.frame':    370 obs. of  35 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
colnames(a) #column names of table
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
#There are 35 variables, with logical, integer, double and character vectors
#Table predicts student performance in secondary education (high school) with alcohol consumption
#The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires

#To study relation to low/high alcohol consumption:
#I selected Fedu, studytime, romantic, absences
#I think maybe parents higher education, high studytime would lead to low alcohol consumption
#And romantic relationships and absences from school would high alcohol consumption

##############################################################################

c.names = c("Fedu","studytime","romantic","absences")
par(mfrow = c(2,2))
for(ii in 1:length(c.names)){
barplot(table(a[,c.names[ii]]), main = c.names[ii], col="blue")
}

#all females are educated, many also highly
#usually people study around 2h/week
#majority are not in a romantic relationship
#majority of absences are between 0-5h

#alcohol consumption vs. female education 
table(a$alc_use,a$Fedu)
##      
##        0  1  2  3  4
##   1    2 25 42 37 34
##   1.5  0 15 15 18 15
##   2    0  9 18 15 14
##   2.5  0  7  9 13 12
##   3    0  8 10  8  6
##   3.5  0  5  5  1  6
##   4    0  1  1  5  2
##   4.5  0  1  1  0  1
##   5    0  2  4  0  3
plot(table(a$alc_use,a$Fedu))
#there does not seem to be very clear pattern
#overall people seem to drink only 1 dose/week, no matter on mothers education
#maybe at very high 5 doses/week, education of the mother is 1-2

#alcohol consumption vs. studytime
table(a$alc_use,a$studytime)
##      
##        1  2  3  4
##   1   21 72 30 17
##   1.5 18 31 10  4
##   2   17 25 12  2
##   2.5 10 25  5  1
##   3   12 18  1  1
##   3.5 11  4  2  0
##   4    4  4  0  1
##   4.5  0  3  0  0
##   5    5  3  0  1
plot(table(a$alc_use,a$studytime))
#people who study less (1h), seem to drink more in high dosages (3.5-5)
#people who study more (4h), seem to drink less overall
#average people study 2h/week and drink 1 dose

#alcohol consumption vs. romantic
table(a$alc_use,a$romantic)
##      
##       no yes
##   1   98  42
##   1.5 42  21
##   2   33  23
##   2.5 29  12
##   3   25   7
##   3.5 13   4
##   4    6   3
##   4.5  1   2
##   5    4   5
plot(table(a$alc_use,a$romantic))
#it seems that overall majority drinks only 1 dose/week
#number of people in a romantic relationship seem to drink less, 
#until we get into very high doses (4.5-5)

#alcohol consumption vs. absences
table(a$alc_use,a$absences)
##      
##        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 17 18 19 20 21 26 27 29
##   1   31 18 24 21 14  4 10  3  6  2  4  0  1  0  0  0  0  0  0  0  1  0  0  0
##   1.5 10 10  8  6  5  7  3  2  4  2  0  2  0  1  0  0  0  1  0  2  0  0  0  0
##   2    9  9  8  4  5  5  3  4  4  1  1  0  2  0  1  0  0  0  0  0  0  0  0  0
##   2.5  6  5  3  4  5  2  3  1  1  3  0  2  0  0  3  0  1  0  0  0  0  0  1  0
##   3    4  6  6  1  2  2  1  0  2  1  0  1  2  0  0  1  0  0  1  0  0  1  0  1
##   3.5  3  2  2  0  3  1  0  1  1  1  1  0  0  0  0  0  0  1  0  0  1  0  0  0
##   4    0  0  3  1  0  0  1  1  1  0  0  0  2  0  0  0  0  0  0  0  0  0  0  0
##   4.5  0  0  0  0  0  0  0  0  1  0  0  0  0  0  2  0  0  0  0  0  0  0  0  0
##   5    0  0  2  1  1  1  0  0  0  1  1  0  0  1  1  0  0  0  0  0  0  0  0  0
##      
##       44 45
##   1    0  1
##   1.5  0  0
##   2    0  0
##   2.5  1  0
##   3    0  0
##   3.5  0  0
##   4    0  0
##   4.5  0  0
##   5    0  0
plot(table(a$alc_use,a$absences))

#people who drink less (1 dose/week), seem to be have less absences
#people who drink more (4-5 doses/week), seem to have more absences
#majority of people still drink less as said before

#previously I predicted that high mothers education and high studytime would lead to less drinking
#seems by estimation that mothers education does not seem to have such a huge effect
#but high studytime seem to lead to less alcohol consumption
#also
#I predicted that romantic relationships and high amount of absences leads to high alcohol consumption
#seem romantic relationships effect the opposite way, to less alcohol
#low absences lead to low alcohol consumption, high absences to high alcohol consumption

##############################################################################

#doing logistic regression to high_use (high alcohol consumption)
#where Fedu, studytime, absences and romantic status are predictors

summary(a$high_use)#high_use FALSE 259, TRUE 111
##    Mode   FALSE    TRUE 
## logical     259     111
table(as.numeric(a$high_use),a$high_use) #FALSE 0, TRUE 1
##    
##     FALSE TRUE
##   0   259    0
##   1     0  111
#high_use outcome variable --> modeling TRUE answer

m=glm(high_use ~ Fedu + studytime + absences + romantic, data=a)
summary(m)
## 
## Call:
## glm(formula = high_use ~ Fedu + studytime + absences + romantic, 
##     data = a)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.4473666  0.0848465   5.273 2.31e-07 ***
## Fedu        -0.0001055  0.0211342  -0.005 0.996018    
## studytime   -0.1023084  0.0272659  -3.752 0.000204 ***
## absences     0.0171176  0.0042214   4.055 6.13e-05 ***
## romanticyes -0.0474788  0.0493497  -0.962 0.336641    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1941708)
## 
##     Null deviance: 77.700  on 369  degrees of freedom
## Residual deviance: 70.872  on 365  degrees of freedom
## AIC: 450.54
## 
## Number of Fisher Scoring iterations: 2
#Fedu and romanticYES does not seem to be significant in this model
##Fedu would decrease alcohol consumption but not significantly
##romanticYESwould decrease alcohol consumption but not significantly
#studytime seems to decrease probability for high alcohol consumption
#abcences seem to increase probability for high alcohol consumption

#I predicted: mother education and studytime would decrease alcohol consumption
#And: romantic relationship and absences would increase alcohol consumption
#Mothers education and studytime will predict decreased alcohol consumption, but mothers eduction does not do that significantly
#Romantic relationships do not increase alcohol consumption, it decreases it, but not significantly
#Absences do increase high alcohol consumption as I predicted in the beginning

##############################################################################

#creating new logistic regression model with significant variables
#high_use ~ studytime + absences significantly
mm=glm(high_use ~ studytime + absences, data=a)
summary(mm)
## 
## Call:
## glm(formula = high_use ~ studytime + absences, data = a)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.43618    0.06491   6.720 6.95e-11 ***
## studytime   -0.10350    0.02720  -3.805 0.000166 ***
## absences     0.01669    0.00419   3.984 8.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1936026)
## 
##     Null deviance: 77.700  on 369  degrees of freedom
## Residual deviance: 71.052  on 367  degrees of freedom
## AIC: 447.48
## 
## Number of Fisher Scoring iterations: 2
#creating new table with probabilities and predictions of model
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
alc <- mutate(a, probability = predict(mm, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)

#true values of high alcohol use
actual <- summary(alc$high_use)
#predicted values of high alcohol use with studytime and absences
prediction <- summary(alc$prediction)
act_pred <- cbind(actual,prediction)
act_pred
##       actual    prediction
## Mode  "logical" "logical" 
## FALSE "259"     "349"     
## TRUE  "111"     "21"
#actual: there are 259 who have low alcohol consumption, as 111 have high
#model: there are 349 who have low alcohol consumption, as 21 have high

#plotting the actual alcohol consumption rates and probability rates:
plot(alc$alc_use)
plot(alc$probability) #plots seems to go down to page
#it seem the model underestimates the high alcohol consumption

#2x2 crosstabular of actual values and predictions
table(alc$high_use,alc$prediction) 
##        
##         FALSE TRUE
##   FALSE   252    7
##   TRUE     97   14
#cross-tabulation shows quite many TRUE-FALSE in prediction
#not very good fitting to the model

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = mm, K = nrow(alc))
# average number of wrong predictions in the cross validation
cv$delta[1] #model gives wrong predictions: 0.2864865 = ~29%
## [1] 0.2864865
#guessing predictions with studytime=4, absences=0:
predict(mm, type = "response",
        newdata = data.frame(studytime = 4, absences = 0))
##          1 
## 0.02218847
#model prediction:0.02218847 = 22%
#when studytime=4, absences=0; alcohol consumption values 1.0, 2.0, 1.0 and 1.5 (scale:1-5)
(((1+2+1+1.5)/4)/5) #average and normalised value
## [1] 0.275
#real: 0.275 = 28%
#model quite right with these settings (real:FALSE, prediction:FALSE)

#guessing predictions with studytime=1, absences=27:
predict(mm, type = "response",
        newdata = data.frame(studytime = 1, absences = 27))
##         1 
## 0.7833464
#model prediction:0.7833464  = 78%
#when studytime=1, absences=27; alcohol consumption value 2.5 (scale:1-5)
(2.5/5) #average and normalised value
## [1] 0.5
#real: 0.5 = 50%
#model okay-ish with these settings (real:TRUE, prediction:TRUE)

#the model works to some extend okay, but there is a lot of error
#model would need more predictors to make it better

##############################################################################

#10-fold cross-validation
cv10 <- cv.glm(data = alc, cost = loss_func, glmfit = mm, K = 10)
# average number of wrong predictions in the cross validation
cv10$delta[1] #model gives wrong predictions: 0.2837838 = ~28%
## [1] 0.2837838
#Model does not get better with 10-fold cross-validation
#This set seems to have higher prediction error than in the Exercise3 (26%)
#With differ/more predictors such model could be perhaps found

##############################################################################

#making a new logistic regression model with lots of predictors:
m2=glm(high_use ~ age+traveltime+studytime+famrel+freetime+goout+health+failures+absences
         , data=a, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ age + traveltime + studytime + famrel + 
##     freetime + goout + health + failures + absences, family = "binomial", 
##     data = a)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.85188    2.06322  -1.867  0.06191 .  
## age          0.06714    0.11775   0.570  0.56855    
## traveltime   0.37155    0.18580   2.000  0.04553 *  
## studytime   -0.46960    0.17578  -2.671  0.00755 ** 
## famrel      -0.42247    0.14607  -2.892  0.00383 ** 
## freetime     0.18999    0.14557   1.305  0.19184    
## goout        0.70070    0.12886   5.438  5.4e-08 ***
## health       0.16616    0.09720   1.709  0.08736 .  
## failures     0.22758    0.23451   0.970  0.33181    
## absences     0.06848    0.02202   3.110  0.00187 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 358.70  on 360  degrees of freedom
## AIC: 378.7
## 
## Number of Fisher Scoring iterations: 4
#counting average number of wrong predictions
cv2 <- cv.glm(data = a, cost = loss_func, glmfit = m2, K = nrow(a))
cv2$delta[1] #0.2459459 average number of wrong predictions in the cross validation
## [1] 0.2459459
#adding predictions and probability to the table:
alc2 <- mutate(a, probability = predict(m2, type = "response"))
alc2 <- mutate(alc2, prediction = probability > 0.5)

#seeing actual values and predicted ones in one table
prediction2 <- summary(alc2$prediction)
act_pred2 <- cbind(actual,prediction2)
act_pred2
##       actual    prediction2
## Mode  "logical" "logical"  
## FALSE "259"     "294"      
## TRUE  "111"     "76"
#model2 (m2) is getting closer to the real data than model mm

#2x2 crosstabular of actual values and predictions:
table(alc2$high_use,alc2$prediction)
##        
##         FALSE TRUE
##   FALSE   232   27
##   TRUE     62   49
#There are more FALSE-TRUE, but also way more TRUE-FALSE
#Overall the model seems to be better

#adjusting the model according to significant predictors:
m3=glm(high_use ~ traveltime+studytime+famrel+goout+health+absences
         , data=a, family = "binomial")
summary(m3)
## 
## Call:
## glm(formula = high_use ~ traveltime + studytime + famrel + goout + 
##     health + absences, family = "binomial", data = a)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.35767    0.84403  -2.793  0.00522 ** 
## traveltime   0.38673    0.18207   2.124  0.03366 *  
## studytime   -0.52063    0.17144  -3.037  0.00239 ** 
## famrel      -0.40296    0.14244  -2.829  0.00467 ** 
## goout        0.77461    0.12334   6.280 3.38e-10 ***
## health       0.17958    0.09532   1.884  0.05957 .  
## absences     0.06861    0.02191   3.132  0.00174 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 362.02  on 363  degrees of freedom
## AIC: 376.02
## 
## Number of Fisher Scoring iterations: 4
#counting average number of wrong predictions:
cv3 <- cv.glm(data = a, cost = loss_func, glmfit = m3, K = nrow(a))
cv33 <- cv3$delta[1] #0.2378378 average number of wrong predictions in the cross validation

#adding predictions and probability to the table:
alc3 <- mutate(a, probability = predict(m3, type = "response"))
alc3 <- mutate(alc3, prediction = probability > 0.5)

#seeing actual values and predicted ones in one table:
prediction3 <- summary(alc3$prediction)
act_pred3 <- cbind(actual,prediction3)
act_pred3
##       actual    prediction3
## Mode  "logical" "logical"  
## FALSE "259"     "293"      
## TRUE  "111"     "77"
#model3 (m3) is getting even closer to the real data (predictability better)

#2x2 crosstabular of actual values and predictions:
table(alc3$high_use,alc3$prediction) 
##        
##         FALSE TRUE
##   FALSE   232   27
##   TRUE     61   50
#There are more FALSE-TRUE, but also way more TRUE-FALSE
#Overall the model seems to be better

#In both models where there are more predictors/more significant predictors , the prediction model gets better (error smaller and model fit better with cross tabulars)

#I did not know how to make a graph to show both training and testing errors by the number of predictors in the model


Chapter 4 - Clustering and classification

Week 4: Clustering and classification

date() 
## [1] "Mon Dec  4 19:57:28 2023"

2) Analysis exercises

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)

#loading the data
data("Boston")

#exploring the dataset
dim(Boston) #504 rows and 14 variables
## [1] 506  14
str(Boston) #numeric and integrin vectors
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
#Data describes methodological problems associated with the use of housing market data to measure the willingness to pay for clean air
#With the use of a hedonic housing price model and data for the Boston metropolitan area, quantitative estimates of the willingness to pay for air quality improvements are generated
#Marginal air pollution damages (as revealed in the housing market) are found to increase with the level of air pollution and with household income

#graphical overview and summary of the data:
plot(Boston)
pairs(Boston)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
cor_matrix <- cor(Boston) 
corrplot(cor_matrix, method="circle")

#lots of variables, some seem to not correlate, some seem to correlate negative/positive to each others

#scaling data:
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
#all the data was scaled to zero (mean to zero)

#`crim` (per capita crime rate by town)
#cut the variable by to get the high, low and middle rates of crime into their own categories
boston_scaled$crim <- as.numeric(boston_scaled$crim)
#using the quantiles as the break points (bins)
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#creating categorical variable
crime <- cut(boston_scaled$crim, breaks =bins, include.lowest = TRUE)
#removing original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
#adding the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
#creating testing and training data:
n <- nrow(Boston)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
#doing linear discriminant analysis
lda.fit = lda(crime ~ ., data=train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##       0.2475248       0.2400990       0.2549505       0.2574257 
## 
## Group means:
##                         zn      indus        chas        nox         rm
## [-0.419,-0.411]  0.8676773 -0.8743397 -0.19358706 -0.8798680  0.4309682
## (-0.411,-0.39]  -0.1291940 -0.2638612  0.01179157 -0.5462142 -0.1517527
## (-0.39,0.00739] -0.3856672  0.2864380  0.30103504  0.4469401  0.1130815
## (0.00739,9.92]  -0.4872402  1.0170690 -0.04518867  1.0863340 -0.4266874
##                        age        dis        rad        tax     ptratio
## [-0.419,-0.411] -0.8831228  0.8551820 -0.6981998 -0.7160272 -0.45198994
## (-0.411,-0.39]  -0.2385570  0.3213141 -0.5603720 -0.5091726 -0.03755605
## (-0.39,0.00739]  0.4094339 -0.4081168 -0.4243629 -0.2892640 -0.34898504
## (0.00739,9.92]   0.8074894 -0.8501113  1.6386213  1.5144083  0.78135074
##                       black       lstat        medv
## [-0.419,-0.411]  0.37181027 -0.74283960  0.50093960
## (-0.411,-0.39]   0.31180364 -0.09135079 -0.01387955
## (-0.39,0.00739]  0.02797664  0.02348579  0.22761592
## (0.00739,9.92]  -0.74922376  0.85231810 -0.68396515
## 
## Coefficients of linear discriminants:
##                 LD1          LD2        LD3
## zn       0.07573442  0.642582577 -0.8777281
## indus    0.06467029 -0.281799252  0.2842424
## chas    -0.09415488 -0.128864688  0.0796810
## nox      0.36913553 -0.785514540 -1.3149356
## rm      -0.08922996 -0.094113619 -0.2125112
## age      0.20688024 -0.264001262  0.2103381
## dis     -0.01899534 -0.256018753  0.2266716
## rad      3.39016563  0.923254217  0.1736943
## tax      0.04053785  0.009158699  0.1330706
## ptratio  0.12804937 -0.032002684 -0.1612895
## black   -0.10654682  0.076142431  0.1296544
## lstat    0.25488188 -0.247476685  0.2745129
## medv     0.22610267 -0.448557564 -0.2209189
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9542 0.0356 0.0102
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
#plotting lda results biplot)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

#saving the crime categories from the test set
correct_classes <- test$crime
#then removing the categorical crime variable from the test dataset
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##                  predicted
## correct           [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##   [-0.419,-0.411]              19              7               1              0
##   (-0.411,-0.39]                6             19               4              0
##   (-0.39,0.00739]               0             14               7              2
##   (0.00739,9.92]                0              0               0             23
#predicted values seem correct only in class 4 (0.00739,9.92)
#mostly predictions seem okay-ish (majority in right class), but it could be better

#reloading the data
data("Boston")
#scaling data:
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled$dis)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2658 -0.8049 -0.2790  0.0000  0.6617  3.9566
#calculating distances between the observations
distance <- dist(boston_scaled)
summary(distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
#trying to identify right cluster number
set.seed(140)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#right cluster number seems to be 6 for k-means clustering
km <- kmeans(boston_scaled, centers = 6)
# plot the dataset with clusters
pairs(boston_scaled, col = km$cluster)

############ BONUS ############

#reloading data:
data("Boston")
#scaling data:
boston_scaled <- as.data.frame(scale(Boston))
#k-clustering
km3 <- kmeans(boston_scaled, centers = 8)
boston_scaled$kmcluster <- as.numeric(km3$cluster)
#making lda modeling
lda.fit2 = lda(kmcluster~., data=boston_scaled)
lda.fit2
## Call:
## lda(kmcluster ~ ., data = boston_scaled)
## 
## Prior probabilities of groups:
##          1          2          3          4          5          6          7 
## 0.21343874 0.04150198 0.21541502 0.06521739 0.10869565 0.15415020 0.10276680 
##          8 
## 0.09881423 
## 
## Group means:
##         crim           zn      indus        chas        nox          rm
## 1 -0.3768106 -0.425505051 -0.3701112  0.09221725 -0.2286148 -0.35000772
## 2  3.2082952 -0.487240187  1.0149946 -0.27232907  1.0998477 -1.45566793
## 3 -0.4060502 -0.005364112 -0.6583486 -0.27232907 -0.8405622 -0.04480167
## 4  1.1193656 -0.487240187  1.0149946 -0.27232907  0.9950574 -0.19450805
## 5 -0.4146327  2.524683754 -1.2040537 -0.20074543 -1.2034679  0.73835915
## 6  0.4620310 -0.487240187  1.0149946  0.13147608  1.0021383 -0.15800782
## 7 -0.2727474 -0.487240187  1.5613334  0.10623826  1.1631724 -0.63230595
## 8 -0.3681800 -0.053323566 -0.7442735  0.59383298 -0.2416605  1.68533550
##          age        dis        rad         tax     ptratio      black
## 1  0.3376923 -0.1286995 -0.5799077 -0.53244743  0.22283672  0.2729966
## 2  1.0064301 -1.0710604  1.6596029  1.52941294  0.80577843 -0.1691195
## 3 -1.0244892  0.8700429 -0.5720054 -0.70802741 -0.07989337  0.3722489
## 4  0.7406815 -0.8461740  1.6596029  1.52941294  0.80577843 -3.2850509
## 5 -1.4205771  1.6272606 -0.6832698 -0.53843478 -0.89403340  0.3540831
## 6  0.6920432 -0.7470446  1.6596029  1.52941294  0.80577843  0.1640227
## 7  0.9324774 -0.9083143 -0.6130366  0.03111252 -0.35520300 -0.1409865
## 8  0.1056916 -0.2903328 -0.4926242 -0.78414273 -1.08156698  0.3392477
##        lstat        medv
## 1  0.1641254 -0.25817616
## 2  2.0102765 -1.39479170
## 3 -0.5625097  0.09778119
## 4  1.1195132 -1.03254705
## 5 -0.9678176  0.83523455
## 6  0.3945960 -0.31539874
## 7  0.6799267 -0.51668841
## 8 -0.9695286  1.72241105
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3          LD4          LD5
## crim     0.19364384  0.126100236 -0.20191652  0.376580103 -0.838531764
## zn      -0.05343615  1.094813039  1.50799254  1.133131460  0.197678623
## indus    0.55399704 -1.280702654  1.13241037  1.197258173  0.124603383
## chas    -0.03302284 -0.024772810 -0.14210875 -0.080434989  0.122406528
## nox      0.05435103 -0.424948658  0.26397352  0.771840326  0.285880375
## rm      -0.04813200  0.181941956  0.09971698 -0.277384152  0.689446855
## age      0.14103637 -0.651476207 -0.09019746 -0.008315478  0.502754473
## dis     -0.33219936  0.286780273  0.03476554  0.281273725 -0.252480041
## rad      5.93891512  1.865585385 -1.41594980 -0.744695993  0.427251550
## tax     -0.05791010  0.519204288  0.49438996  0.580198607  0.087932891
## ptratio  0.21221433 -0.011384841  0.01954111 -0.048612579 -0.116143664
## black   -0.19991944  0.269843607 -1.52326403  1.543950260 -0.002263929
## lstat    0.07512642 -0.006494829  0.08239517 -0.043407190 -0.209298121
## medv    -0.08095390  0.223086304 -0.03618324  0.137450228  0.487212102
##                  LD6         LD7
## crim     1.055793974 -0.76410951
## zn       0.472879553  0.75684606
## indus   -0.908040391 -1.05671771
## chas     0.139637868  0.18984158
## nox     -0.186010320 -0.32881295
## rm       0.077931268 -0.31058189
## age      0.573790467  0.91403286
## dis     -0.721069945 -0.66003474
## rad     -0.851480550 -0.28819620
## tax      0.245047274  0.77133610
## ptratio  0.008931086  0.50612180
## black   -0.001693988  0.02201946
## lstat    0.741261471 -0.23456713
## medv     0.594939439 -0.54523880
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5    LD6    LD7 
## 0.7632 0.1051 0.0506 0.0403 0.0205 0.0141 0.0062
#plotting 
plot(lda.fit2, dimen = 2)
lda.arrows(lda.fit2, myscale = 1)

############ SUPERBONUS ############

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#Next, install and access the plotly package. Create a 3D plot (cool!) of the columns of the matrix product using the code below.
#install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= 'train$crime', colors=c('red','green','blue','violet'))

Chapter 5 - Clustering and classification

Week 5: Dimensionality reduction techniques

date() 
## [1] "Mon Dec  4 19:57:32 2023"

2) Analysis exercises

library(dplyr)
library(readr)
library(corrplot)
library(tibble)
library(GGally)

#reading data in
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#moving the country names to rownames
human_ <- column_to_rownames(human, "Country")

#visualisation of data:
ggpairs(human_, progress = FALSE)

#summary of data:
summary(human_)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
#correlation matrix and its visual representation:
cor(human_)
##                Edu2.FM      Labo.FM   Life.Exp     Edu.Exp         GNI
## Edu2.FM    1.000000000  0.009564039  0.5760299  0.59325156  0.43030485
## Labo.FM    0.009564039  1.000000000 -0.1400125  0.04732183 -0.02173971
## Life.Exp   0.576029853 -0.140012504  1.0000000  0.78943917  0.62666411
## Edu.Exp    0.593251562  0.047321827  0.7894392  1.00000000  0.62433940
## GNI        0.430304846 -0.021739705  0.6266641  0.62433940  1.00000000
## Mat.Mor   -0.660931770  0.240461075 -0.8571684 -0.73570257 -0.49516234
## Ado.Birth -0.529418415  0.120158862 -0.7291774 -0.70356489 -0.55656208
## Parli.F    0.078635285  0.250232608  0.1700863  0.20608156  0.08920818
##              Mat.Mor  Ado.Birth     Parli.F
## Edu2.FM   -0.6609318 -0.5294184  0.07863528
## Labo.FM    0.2404611  0.1201589  0.25023261
## Life.Exp  -0.8571684 -0.7291774  0.17008631
## Edu.Exp   -0.7357026 -0.7035649  0.20608156
## GNI       -0.4951623 -0.5565621  0.08920818
## Mat.Mor    1.0000000  0.7586615 -0.08944000
## Ado.Birth  0.7586615  1.0000000 -0.07087810
## Parli.F   -0.0894400 -0.0708781  1.00000000
corrplot(cor(human_))

#description of data:
#Seems that Perli.F and LaboFM has least correlations with other variables
#Edu.FM2 seem to positively correlate with Life.Exp, Edu.Exp and GNI and 
#negatively correlate with Mat.Mor and Ado.Birth
#Life.Exp positively correlates with Edu2.FM, Edu.Exp and GNI and 
#negatively correlates to Mat.Mor and Ado.Birth
#Overall higher education seems to decrease adolescent birth rate and 
#maternal mortality

#Principal component analysis (PCA) with biplot
pca_human <- prcomp(human_)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03  0.0022190154
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02  0.9865644425
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02  0.1431180282
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05 -0.0001135863
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03  0.0266373214
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03  0.0188618600
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02  7.132267e-01 -7.001533e-01
## Life.Exp  -1.453515e-01  5.380452e-03  2.281723e-03
## Edu.Exp    9.882477e-01 -3.826887e-02  7.776451e-03
## GNI       -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02 -2.642548e-03  2.680113e-03
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

#Principal component analysis (PCA) with biplot with standardized data
human_std <- scale(human_)
pca_human2 <- prcomp(human_std)
pca_human2
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110  0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424 -0.2625067
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819 -0.1628935
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294 -0.1659678
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675  0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614  0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763  0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293  0.1523699
##                   PC6         PC7         PC8
## Edu2.FM   -0.17713316  0.05773644  0.16459453
## Labo.FM    0.03500707 -0.22729927 -0.07304568
## Life.Exp   0.42242796 -0.43406432  0.62737008
## Edu.Exp    0.38606919  0.77962966 -0.05415984
## GNI       -0.11120305 -0.13711838 -0.16961173
## Mat.Mor   -0.17370039  0.35380306  0.72193946
## Ado.Birth  0.76056557 -0.06897064 -0.14335186
## Parli.F   -0.13749772  0.00568387 -0.02306476
biplot(pca_human2, choices = 1:2)

#standardized model is much more readable than non-standardized
#non-standardized:
#GNI seems higher in countries like Singapore, Norway
#From the rest, it is hard to find out

#standardized:
#If we see from the center of the graph:
#Up we seem more females in parliament (Parli.F) and Labour force of females (Labo.FM) (counties like Bolivia, Tanzania)
#Left we see more Education expectation, GNI, educated females, life expectancy (countries like Singapore, Ireland, Canada)
#On right we see maternity mortality and adolescents birth rates (countries like Sierra Leone, Gambia, Burkina Faso)

#Personal opinion:
#In more female educational countries, GNI and life expectancy seems higher and adolescent birth rate and maternal mortality rates are lower
#And males and females have equal labor force, and equal parliament
#Many Western Countries seem to be in this group

#Tea-dataset:

#300 individuals were asked how they drink tea (18 questions) and what are their product's perception (12 questions). 
#In addition, some personal details were asked (4 questions).

#loading data:
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
#visualisation of data:
view(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea) #300rows and 36variables
## [1] 300  36
#selecting certain columns:
library("FactoMineR")
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#plotting certain columns:
library(ggplot2)
library(tidyr)
gather(tea_time) %>% 
  ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar() + theme(axis.text.x=element_text(angle=45, hjust=1))
## Warning: attributes are not identical across measure variables; they will be
## dropped

#MCA analysis on partial data:
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
print(mca)
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 6 variables
## *The results are available in the following objects:
## 
##    name              description                       
## 1  "$eig"            "eigenvalues"                     
## 2  "$var"            "results for the variables"       
## 3  "$var$coord"      "coord. of the categories"        
## 4  "$var$cos2"       "cos2 for the categories"         
## 5  "$var$contrib"    "contributions of the categories" 
## 6  "$var$v.test"     "v-test for the categories"       
## 7  "$var$eta2"       "coord. of variables"             
## 8  "$ind"            "results for the individuals"     
## 9  "$ind$coord"      "coord. for the individuals"      
## 10 "$ind$cos2"       "cos2 for the individuals"        
## 11 "$ind$contrib"    "contributions of the individuals"
## 12 "$call"           "intermediate results"            
## 13 "$call$marge.col" "weights of columns"              
## 14 "$call$marge.li"  "weights of rows"
#plotting MCA analysis
plot(mca, invisible=c("ind"), graph.type = "classic")

#from how variable: "tea bag" seems to be most popular (in center)
#from How variable: tea "alone" seems to be most popular followed by milk and lemon
#from lunch variable: "not.lunch" seem to be most popular
#from sugar variable: "sugar" and "no.sugar" seem to be pretty even
#from tea variable: "Earl Grey" and "black" tea seem more popular than "green"
#from where variable: "chain store" seem the most popular

#to summerize: seem people prefer to drink tea other than lunch time,
#only tea alone, that is Earl Grey or black tea, with sugar or without sugar,
#from a teabag, that has been bought from chain store

(more chapters to be added similarly as we proceed with the course!)